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Abstract 

Natural  language  analysis  of  patents  holds  promise  for  the  development  of  tools  designed  to  assist  analysts  in  the  monitoring  of  emerging 
technologies.  One  component  of  such  tools  is  the  identification  of  technology  terms.  We  describe  an  approach  to  the  discovery  of 
technology  terms  using  supervised  machine  learning  and  evaluate  its  performance  on  subsets  of  patents  in  three  languages:  English, 
German,  and  Chinese. 
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1.  Introduction 

The  timely  detection  of  emerging  technologies  and  the 
monitoring  of  their  worldwide  evolution  pose  daunting 
challenges  for  analysts  (PICMET,  2012).  Not  only  do  these 
tasks  demand  constantly  expanding  domain  expertise  but 
the  rate  of  scientific  publication  is  growing  fast  (Sharma 
et  ah,  2002;  Larsen  and  Ins,  2010). 

Patent  filings  represent  a  leading  indicator  of  the  maturation 
of  technologies  and  their  introduction  into  the  marketplace. 
As  semi-structured  documents,  they  offer  many  opportuni¬ 
ties  for  data  mining  of  natural  language  content.  For  ex¬ 
ample,  citations  and  references  to  prior  art  reflect  the  intel¬ 
lectual  development  of  a  technology  while  the  appearance 
of  novel  terminology  in  a  cluster  of  patents  suggests  the 
emergence  of  a  new  subfield.  Previous  research  on  patents 
has  applied  natural  language  processing  for  the  purpose  of 
summarization  and  clustering  (Tseng  et  ah,  2007),  infringe¬ 
ment  analysis  (Indukuri  et  ah,  2007),  and  computer-assisted 
categorization  (Fall  et  ah,  2003).  Numerous  techniques  for 
the  automatic  extraction  of  terms  and  phrases  in  support  of 
these  tasks  have  been  proposed.  However,  such  efforts  have 
rarely  made  a  distinction  between  terms  that  denote  tech¬ 
nologies  and  other  classes  of  terms.  In  this  paper,  we  seek 
to  automate  the  identification  of  technology  terms  within 
patents  in  order  to  make  this  constantly-growing  technical 
vocabulary  available  for  the  construction  of  higher  level  an¬ 
alytical  tools.  This  work  was  developed  in  the  context  of 
an  automated  system  that  processes  very  large  collections 
of  patents  and  scientific  publications  in  order  to  detect  and 
track  scientific  emergence  within  diverse  science  and  tech¬ 
nology  communities  (Brock  et  ah,  2012;  Babko-Malaya 
et  ah,  2013a;  Thomas  et  ah,  2013;  Babko-Malaya  et  ah, 
2013b). 

Our  approach  to  technology  term  detection  follows  from 
the  successful  application  of  supervised  learning  in  in¬ 
formation  extraction  tasks  such  as  named-entity  detection 
(Nadeau  and  Sekine,  2007)  and  medical  concept  extraction 
from  clinical  records  (Uzuner  et  ah,  2011).  The  general 
methodology  involves  using  a  large  set  of  human  anno¬ 
tated  examples  of  the  target  class(es)  along  with  their  tex¬ 
tual  contexts  to  serve  as  training  examples  for  generating 
a  machine  learned  model  which  exploits  features  extracted 


from  the  labeled  terms  and  their  contexts.  However,  un¬ 
like  the  well-defined  entity  types  in  those  domains  (e.g., 
company  names,  geographical  locations,  medical  symp¬ 
toms  and  treatments),  the  imprecise  definition  and  im¬ 
mense  scope  of  technical  terminology  present  unique  chal¬ 
lenges.  Consider,  for  example,  the  definitions  of  ’’technol¬ 
ogy”  provided  by  the  American  Heritage  Science  Dictio¬ 
nary  (Kleinedler  and  Spitz,  2005): 

1.  The  use  of  scientific  knowledge  to  solve  practical 
problems,  especially  in  industry  and  commerce. 

2.  The  specific  methods,  materials,  and  devices  used  to 
solve  practical  problems. 

The  range  of  terms  that  fit  the  second  definition  above  is 
quite  broad,  running  the  gamut  from  esoteric  devices  like 
magnetometers  and  nanotubes  to  everyday  artifacts  like  ar¬ 
ticles  of  clothing  or  furniture.  Examples  from  WIPOs  Inter¬ 
national  Patent  Classification1 ,  a  large  multi-level  hierarchy 
designed  to  support  the  assignment  of  patents  to  categories, 
follow: 

1 .  Apparatus  for  the  destruction  of  unwanted  vegetation, 
e.g.  weeds  (biocides,  plant  growth  regulators) 

2.  Fittings  or  trimmings  for  hats,  e.g.  hat-bands 

3.  Geodesic  lenses  or  integrated  gratings 

For  our  purposes,  then,  we  define  a  technology  term 
broadly  as  a  lexical  phrase  denoting  an  artifact,  process,  or 
field  of  study  (further  nuances  of  this  definition  are  elabo¬ 
rated  below). 

Since  technology  development  is  a  global  phenomenon, 
monitoring  the  life  cycle  of  technologies  requires  analysts 
to  track  literature  in  many  languages.  Thus,  it  is  critical  that 
the  methodology  for  technology  term  extraction  generalize 
readily  to  multiple  languages.  To  test  the  generalizability 
of  our  approach,  we  apply  and  evaluate  the  methodology 
on  English,  German  and  Chinese  patents. 

The  paper  is  organized  as  follows.  We  first  provide  an 
overview  of  the  full  system,  describing  the  extraction  of 
candidate  technology  terms  from  text,  annotation  strategy, 

1  http://www.wipo.int/classifications/ipc/en/ 
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generation  of  training  instances,  construction  of  a  technol¬ 
ogy  term  classifier,  and  use  of  the  trained  model  to  produce 
a  technology  ontology.  We  then  present  the  results  of  an 
evaluation  on  a  subset  of  English  patents,  followed  by  re¬ 
sults  for  German  and  Chinese.  We  conclude  with  a  discus¬ 
sion  of  these  findings  and  opportunities  for  future  work. 

2.  System  Description 

New  technologies  often  demand  the  creation  of  new  sub¬ 
languages,  while  standardization  of  a  vocabulary  over  time 
tends  to  indicate  the  maturing  of  a  new  field.  Thus,  tem¬ 
poral  fluctuations  and  trends  in  terminology  can  assist  ana¬ 
lysts  in  their  detection  and  assessment  of  technology  emer¬ 
gence,  especially  when  used  in  conjunction  with  other 
actor-network  indicators  (Latour  et  al.,  2010).  Our  goal  is 
the  construction  of  a  comprehensive  and  extensible  lexical 
ontology  of  technical  terms  that  can  serve  the  needs  of  text- 
based  analytical  tools  across  multiple  languages. 

Given  the  vast  number  of  artifacts  and  processes  described 
in  patents,  we  opted  for  a  supervised  machine  learning  ap¬ 
proach  to  technical  term  detection.  The  feasibility  of  this 
approach  depends  upon  both  the  existence  of  discriminative 
contextual  features  and  sufficient  training  data  to  enable  ap¬ 
propriate  feature  weights  to  be  learned  from  examples.  To 
simplify  the  task,  we  preprocessed  the  text  using  shallow 
linguistic  processing  rules  to  select  candidate  words  and 
noun  phrases;  then  supervised  machine  learning  was  em¬ 
ployed  to  classify  these  candidates  as  technology  terms  or 
not.  The  diagram  in  Figure  1  presents  the  overall  architec¬ 
ture  of  the  system. 

2.1.  Pre-processing  and  candidate  selection 

The  patent  data  used  for  building  the  system  consisted  of 
small  collections  of  xml-formatted  patents  randomly  se¬ 
lected  from  LexisNexis’  English,  German,  and  Chinese 
patent  databases.  Each  subset  contained  500  documents 
and  spanned  the  years  between  1980  and  2012.  Each  patent 
was  parsed  with  respect  to  its  xml  document  structure  to 
identify  relevant  sections  (title,  abstract,  first  claim,  back¬ 
ground,  etcetera).  Then  the  Stanford  tagger2  was  run  over 
the  text  to  detect  sentence  boundaries,  extract  tokens  (a  task 
requiring  word  segmentation  in  Chinese)  and  assign  each 
token  a  part-of-speech  tag. 

Next,  a  language-specific  chunker  was  used  to  scan  to¬ 
ken  sequences  greedily  for  the  longest  sequences  matching 
simple  noun  phrase  patterns.  In  English,  most  candidate 
phrases  are  of  the  form  (ADJ?  N*  N).  Each  part-of-speech 
tag  in  a  pattern  may  have  an  associated  list  of  noise  words 
that  are  to  be  excluded  from  the  matched  patterns.  These 
serve  primarily  to  eliminate  many  non-substantive  modi¬ 
fiers  from  the  greedy  phrase  matcher.  For  example,  the 
leading  adjectives  "first”,  "specific”,  or  "following”  would 
be  considered  noise  words  and  excluded  from  any  match¬ 
ing  candidate  phrase  while  substantive  adjective  modifiers 
like  electronic  or  radioactive  would  be  retained.  The  out¬ 
put  of  the  chunker  is  a  list  of  candidate  noun  phrases  along 
with  associated  sets  of  contextual  features  (e.g.,  surround¬ 
ing  words  and  n-grams)  which  serve  as  features  for  ma- 


2http://nlp.  stanford.edu/software/tagger.shtml 


chine  learning.  Similar  chunking  rules  perform  the  equiva¬ 
lent  function  in  German  and  Chinese. 

2.2.  Manual  annotation  of  terms 

Supervised  learning  requires  a  gold  set  of  manually  anno¬ 
tated  instances  that  label  terms  according  to  a  set  of  pre¬ 
defined  classification  criteria.  For  the  purposes  of  annotat¬ 
ing  technologies,  we  defined  a  technology  term  as  a  phrase 
matching  any  of  the  following  criteria: 

•  Artifact  -  a  man-made  object  produced  as  the  result  of 
a  scientific  manufacturing  process  (e.g.,  electron  mi¬ 
croscope,  computer  keyboard) 

•  Process/technique  -  the  name  of  a  method  or  process 
for  creating  an  artifact  or  doing  technical  work  (e.g., 
duty  cycle  control,  electron  microscopy) 

•  Field  -  the  name  of  a  discipline  or  scientific  area  re¬ 
lating  to  the  production  of  artifacts  or  processing  (e.g., 
biotechnology,  construction  engineering) 

In  some  cases,  interpreting  phrases  using  these  criteria 
alone  proved  problematic.  For  example,  many  natural  kinds 
are  produced  by  artificial  means,  such  as  smooth  muscle 
cells  produced  by  cell  culture  or  an  amino  acid  sequence  de¬ 
termined  by  protein  sequencing.  In  the  context  of  patents, 
these  typically  function  as  artifacts  and  hence  technology 
terms.  There  are  some  candidate  noun  phrases  which  in¬ 
clude  appositive  terms,  as  in  ’’clock  pulse  CK”  or  ’’clock 
pulse  cpl”.  Since  ”CK”  is  a  generic  way  to  abbreviate 
’’clock  pulse”,  the  former  phrase  was  considered  a  technol¬ 
ogy  term  whereas  the  latter,  referring  to  an  instance  within 
the  patent,  was  not.  A  patent  typically  makes  many  refer¬ 
ences  to  components  of  an  artifact,  as  in  ”resist-free  back 
side”,  ’’rear  cross  frame  member”,  and  "parent  identifier 
field".  Unless  these  terms  refer  to  components  that  can  rea¬ 
sonably  be  thought  of  as  independent  artifacts,  they  were 
not  to  be  considered  as  denoting  technology  terms.  Also 
problematic  are  broad  terms  which  may  refer  to  a  technol¬ 
ogy  but  in  an  underspecified  manner,  such  as  data  or  cir¬ 
cuits. 

In  order  to  reduce  the  effort  required  for  manual  annotation 
and  to  maximize  its  effectiveness  for  training,  we  made  the 
simplifying  assumption  that  each  phrase  (i.e.,  term  "type”) 
need  only  be  labeled  once,  even  though  some  phrase  in¬ 
stances  might  serve  different  functions  in  different  patents. 
This  simplification  relieved  the  annotator  of  labeling  multi¬ 
ple  instances  of  the  same  term,  a  task  which  would  have  re¬ 
quired  considerable  work,  inspecting  each  context  in  which 
each  term  appeared  within  each  patent.  Instead,  the  annota¬ 
tor  labeled  each  term  within  the  broader  ’’context”  of  tech¬ 
nology  patents  as  a  whole,  deciding  based  on  his/her  un¬ 
derstanding  of  a  term  whether  a  use  of  the  term  would  most 
likely  denote  a  technology.  Assigning  a  label  often  required 
the  annotator  to  do  a  web  search  to  understand  the  meaning 
of  unfamiliar  candidate  phrases.  (A  search  for  the  quoted 
phrase,  sometimes  ANDed  with  the  term  ’’technology”  or 
"definition"  or  both,  usually  produced  enough  information 
in  the  result  set  snippets  to  make  a  decision.)  This  approach 
to  constructing  a  training  set  is  a  form  of  "distant  supervi¬ 
sion"  (Mintz  et  al,  2009)  and  runs  the  risk  of  introducing 
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Figure  1 :  System  Diagram 


noise.  For  example,  some  terms,  such  as  generic  single 
word  terms  that  have  several  distinct  meanings  or  phrases 
that  may  refer  to  both  a  natural  kind  and  an  artifact,  are 
particularly  difficult  to  classify  and  indeed  may  not  have  a 
single  dominant  interpretation  in  the  corpus.  Rather  than 
force  a  decision,  we  gave  the  annotator  the  additional  op¬ 
tion  of  labeling  a  term  ”?”  whenever  the  annotator  lacked 
the  confidence  to  choose  a  single  classification  for  the  term 
out  of  context.  Such  labeled  terms  were  not  included  in  the 
gold  set  for  training  the  model. 

Candidate  terms  for  annotation  were  generated  using  the 
output  of  the  chunker  and  sorted  by  document  frequency 
so  that  more  common  terms  were  labeled  first.  More  fre¬ 
quently  occurring  terms  would  be  expected  to  generate 
more  training  instances  when  applied  to  the  corpus.  For 
each  language,  annotators  provided  a  minimum  of  2000  la¬ 
beled  terms,  for  English,  extra  terms  were  annotated,  result¬ 
ing  in  a  set  of  3784  labeled  terms.  The  overall  agreement 
between  the  annotators,  using  Cohen’s  Kappa,  was  0.52, 
suggesting  moderate  agreement.  The  annotators  were  not 
experts  in  the  technical  areas  of  the  patents. 

2.3.  Features 

To  create  training  instances  from  the  labeled  terms,  each 
term  and  label  were  combined  with  a  contextual  features 
associated  with  occurrences  of  the  term  found  within  the 
document  collection.  Features  fell  into  the  following  cate¬ 
gories: 

•  External  local  context:  ngrams  of  size  1,  2,  and  3  to 
the  left  and  right  of  the  term 

•  External  syntactic  context:  rule-based  ’’dependency” 
relationships  between  the  term  and  preceding  nouns, 
verbs  and  adjectives  (prev  V,  prevJNpr,  prevJpr, 
prevJ).  These  were  intended  to  capture,  for  ex¬ 
ample,  the  verb  (and  any  prepositions/articles)  for 
which  the  term  is  the  object.  prev_Npr  captures 
a  dominating  head  noun  and  preposition  (e.g.,  the 
phrase  ”a  large  reduction  in  the  cpu  speed"  would 
generate  the  feature  prev  Npr=rcduction  in  for  the 


term  ”cpu  speed,  whereas  the  ngram  context  would 
create  the  features  prev_nl=the,  prev_n2=in_the, 
prev_n3=reduction  Jn_the). 

•  Internal  features:  these  include  number  of  tokens  in 
the  phrase,  first  jword,  last  word,  and  suffixes  of  length 
3,4,  and  5  characters. 

•  Document  location  features:  term’s  location  within  the 
structure  of  the  patent,  broken  down  by  ’’1st  sentence” 
and  ’’later  sentence”  within  title,  abstract,  summary, 
description,  and  first  claim. 

Table  1  shows  the  total  number  of  potential  training  in¬ 
stances  produced  for  the  500-document  collections  in  three 
languages,  as  well  as  the  percentages  of  them  covered  by 
the  most  frequent  N  labeled  types.  The  numbers  suggest 
that  a  relatively  minor  annotation  effort  can  generate  a  sig¬ 
nificant  number  of  training  instances.  We  will  discuss  the 
number  of  positive  and  negative  examples  again  in  a  later 
section. 


instances 

100 

1000 

2000 

5000 

English 

237,960 

10% 

29% 

36% 

48% 

Chinese 

133,921 

21% 

49% 

60% 

75% 

German 

87,469 

20% 

50% 

61% 

77% 

Table  1 :  Share  of  N  most  frequent  candidate  terms 


Since  the  same  term  can  appear  multiple  times  within  a  sin¬ 
gle  document,  there  are  several  approaches  to  generating 
training  instances  for  a  classifier.  We  could  treat  each  sin¬ 
gle  term  occurrence  as  a  separate  instance  for  training  or 
else  merge  features  from  multiple  occurrences  within  a  sin¬ 
gle  patent  into  a  single  feature  vector.  While  we  plan  to 
compare  both  approaches  in  future  work,  for  this  study  we 
opted  for  the  latter  approach,  as  it  allows  for  a  model  to  be 
trained  directly  on  the  conjunction  of  features  found  within 
each  document.  Multiple  occurrences  of  the  same  feature 
were  collapsed  into  a  single  feature,  rather  than  counted  or 
weighted.  The  output  of  this  step,  then,  was  a  list  of  binary 
feature  vectors,  one  for  each  term  (type)  within  a  document. 
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2.4.  Classification 

We  used  the  training  data  from  each  language  collection 
to  train  a  maximum  entropy  classifier  using  the  mallet  tool 
kit  (McCallum,  2002).  The  resulting  models  can  be  ap¬ 
plied  to  our  task  in  two  different  ways.  A  model  can  be 
used  dynamically  to  detect  technology  terms  in  a  new  un¬ 
seen  patent.  Alternatively,  a  model  can  be  applied  in  batch 
mode  to  a  large  collection  to  create  a  global  ontology  of 
technology  terms.  In  this  mode,  the  category  scores  for 
the  same  term  across  multiple  documents  are  merged  into 
a  single  statistic  (e.g.,  by  computing  their  average,  min  or 
max  scores).  This  approach  allows  scoring  for  each  term 
to  be  based  on  a  larger  sample  of  patents,  which  may  lead 
to  more  reliable  categorization.  Building  a  global  ontol¬ 
ogy  off-line  also  allows  for  terminology  detection  in  new 
patents  to  be  done  simply  and  efficiently  using  dictionary 
lookup.  However,  this  approach  risks  lower  recall  as  the 
global  ontology  lacks  knowledge  of  any  previously  unseen 
terms.  A  hybrid  approach,  in  which  classification  scores 
are  dynamically  computed  for  all  candidate  terms  in  a  new 
document  while  global  ontology  scores  are  used  to  bias 
decisions  about  previously  seen  terms  may  offer  the  best 
solution  by  combining  local  (document)  and  global  (col¬ 
lection)  information.  Since  the  mallet  classifier  output  in¬ 
cludes  probability  scores  for  each  class,  it  is  possible  to  set 
arbitrary  thresholds  for  accepting  technology  terms  based 
on  desired  levels  of  precision  and  recall. 

3.  Results  and  Discussion 

To  evaluate  our  system,  we  divided  a  randomly  selected 
500-document  English  collection  into  a  training  set  of  490 
patents  and  a  test  set  of  the  remaining  10  patents.  Over 
3700  candidate  phrases  from  the  training  collection  and 
nearly  1500  from  the  test  set  were  annotated  with  ”y”  or 
”n”  labels.  Any  terms  appearing  in  the  test  (’’gold”)  set 
were  subsequently  removed  from  the  training  set  so  that 
the  two  labeled  term  sets  were  disjoint.  A  maximum  en¬ 
tropy  classifier  was  trained  on  labeled  instances  from  the 
training  collection.  The  model  thus  created  (named  Model 
Mi)  was  used  to  generate  probability  scores  for  the  test  set 
terms.  Using  the  gold  set  labels,  precision,  recall  and  f- 
score  were  computed  for  the  system-generated  results  at  the 
acceptance  threshold  of  0.5.  The  results  are  shown  below 
in  Table  2. 


P  R  FI 

Mi 

0.63  0.57  0.60 

Table  2:  Precision,  recall  and  f-score 

We  examined  high  and  low  scoring  terms  within  the  evalu¬ 
ation  set  to  better  understand  the  nature  of  the  false  pos¬ 
itives  and  false  negatives  (Table  3).  Among  the  highest 
system  scoring  terms  for  which  the  manual  (gold)  annota¬ 
tion  was  negative  we  find  some  generic  artifact  terms  (’’de¬ 
vice”,  ’’identifier”)  which  may,  under  the  circumstances, 
have  qualified  as  artifacts.  This  exemplifies  the  difficulty 
of  annotating  terms  for  the  purpose  of  classifying  artifacts. 
There  is  a  large  class  of  highly  specialized  unambiguous 
terms  (such  as  the  true  positives  shown  in  the  table).  At 


the  same  time,  there  is  a  large  class  of  common  terms  for 
which  the  ’’correct  label”  is  less  well-defined.  To  some  ex¬ 
tent,  these  terms  are  not  particularly  interesting,  given  that 
analysts  will  be  interested  only  in  the  specialized  terms,  not 
the  general  ones.  However,  labeled  general  terms  in  the 
training  data  (and  in  the  evaluation)  will  impact  both  the  ac¬ 
tual  performance  (and  evaluation)  of  the  system.  Similar  is¬ 
sues  arise  for  some  of  the  negatively  labeled  terms:  ’’storage 
system  unit”  and  ’’long  extended  conductor  device”  are  ar¬ 
guably  ’’descriptions”  of  artifacts  rather  than  terms  directly 
denoting  artifacts,  but  nonetheless  the  labels  used  for  train¬ 
ing  purposes  could  have  a  direct  impact  on  the  effectiveness 
of  training  data,  given  that  the  contextual  features  for  arti¬ 
fact  descriptions  are  likely  to  be  the  same  as  for  artifact 
terms.  This  suggests  a  need  for  further  refinement  of  our 
annotation  guidelines,  particularly  concerning  the  proper 
labeling  of  generic  terms  and  descriptive  phrases. 

Low  scoring  terms  with  positive  gold  labels  (false  nega¬ 
tives)  include  many  single  word  terms  that  are  unambigu¬ 
ously  artifacts:  ’’database”,  ”cpu”  and  ’’solvents”.  While  it 
is  possible  that  their  roles  in  the  particular  patents  used  for 
evaluation  may  have  been  minor  enough  to  lack  sufficient 
contextual  clues  to  identify  them  as  such,  their  scores  are 
more  likely  a  symptom  related  to  the  class  of  single  word 
terms. 
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0.988578 

graphics  processor 

y 

0.986901 

communications  system 

y 
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computer  vision  system 

y 
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luminescent  nanoparticles 

y 

0.981159 

spatial  analysis 
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long  extended  conductor  device 

n 

0.892993 

coronary  artery 

n 

0.892514 

device 

n 

0.892496 

light  source 

n 

0.880899 

identifier 

n 

0.000000 

lowered  position 

n 

0.000000 

interior 

n 

0.000000 

hook-like  part 

n 

0.000000 

highest  position 

n 

0.000000 

guide  walls 

y 

0.026642 

algorithm 

y 

0.017968 

cpu 

y 

0.017956 

solvents 

y 

0.017776 

pixels 

y 

0.014474 

polymerization 

Table  3:  High  and  low  scoring  terms  with  their  gold  labels. 
Groupings  capture  true  positives,  false  positives,  true  nega¬ 
tives,  and  false  negatives,  respectively.  The  table  shows  the 
gold  label,  the  system  score  and  the  term. 

Such  observations  raised  a  number  of  questions  about  our 
system  design,  ranging  from  the  efficacy  of  specific  fea¬ 
ture  types  to  the  consequences  of  the  distant  supervision 
approach.  In  particular,  we  were  interested  in  the  following 
questions: 

•  Since  we  are  using  a  large  set  of  labeled  ’’seed  terms” 
to  create  training  instances  through  distant  supervision 
rather  than  annotating  each  term  in  context,  how  is 
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performance  affected  by  the  mix  of  tokens  and  types 
appearing  in  the  generated  training  instances?  As  the 
size  of  the  training  instance  set  generated  from  the 
seed  terms  grows,  more  frequently  occurring  labeled 
terms  may  gain  greater  representation  in  the  training 
set.  However,  the  most  frequently  occurring  terms  are 
also  the  terms  most  likely  to  have  ambiguous  interpre¬ 
tations,  which  could  introduce  noise  into  the  training 
data.  Would  there  be  any  benefit  to  setting  thresholds 
for  the  contributions  of  frequent  types  when  building 
the  training  data? 

•  What  is  the  relative  importance  of  external  contextual 
features  vs.  internal  information  about  the  term  itself 
(e.g.,  head  word  and  suffix  features)? 

•  Given  the  apparent  importance  of  term  internal  in¬ 
formation  (head  words  and  suffixes)  for  classifying 
phrases  and  the  fact  that  the  vast  majority  of  terms  are 
multiword  phrases,  how  are  single  word  terms  (that 
lack  these  clues)  impacted?  Would  it  be  more  appro¬ 
priate  to  train  separate  models  for  single  words  and 
phrases? 

•  Training  instances  are  constructed  by  joining  in  a  sin¬ 
gle  vector  all  features  related  to  all  occurrences  of  a 
term  within  a  document.  Would  there  be  an  advantage 
to  weighting  the  feature  vector  by  feature  occurrence 
counts,  vs.  treating  it  as  a  binary  (presence/absense) 
vector? 

•  Are  a  term’s  locations  within  a  patent  related  to  its 
likelihood  to  be  an  artifact?  What  is  the  contribution 
of  including  location  information  as  features? 

•  Are  the  n-gram  features  preceding  the  term  redundant 
with  or  more  or  less  important  than  the  dependency 
based  features?  Do  both  sets  of  features  make  inde¬ 
pendent  contributions  to  the  performance? 

We  conducted  experiments  to  investigate  some  of  these 
questions.  Regarding  the  issue  of  transfer  of  labeled  terms 
from  one  patent  collection  to  another,  we  had  focused  our 
annotation  effort  on  labeling  the  most  frequent  terms  in  our 
source  collection  in  order  to  maximize  transfer.  However, 
patents  contain  many  rare  and  specialized  terms  and  a  sig¬ 
nificant  overlap  of  terms  from  one  set  to  another,  especially 
across  domains,  is  not  guaranteed.  To  test  the  effect  of  train¬ 
ing  using  a  set  of  patents  different  from  those  from  which 
our  original  annotations  were  drawn,  we  randomly  assem¬ 
bled  a  different  collection  of  500  patents,  generated  training 
instances  from  it  and  tested  the  resulting  model  on  our  eval¬ 
uation  data.  The  original  model  M\  had  3,808  positive  in¬ 
stances  and  40,589  negative  instance,  distributed  over  1,949 
positive  types  and  1,778  negative  types.  Building  the  new 
model  M2  resulted  in  2,880  positive  instances  and  37,480 
negative  instance,  distributed  over  389  positive  types  and 
1,070  negative  types.  The  results  are  shown  in  Table  4.  As 
expected,  there  is  a  drop  in  performance,  due,  most  likely, 
to  the  decrease  in  the  number  of  training  types  generated 
from  this  collection. 


P  R  FI 

Mi 

m2 

0.63  0.57  0.60 
0.59  0.55  0.57 

Table  4:  Precision,  recall  and  f-score  for  two  models  of  the 
same  size 

In  an  attempt  to  overcome  the  performance  deficit,  we  ex¬ 
perimented  with  enlarging  the  patent  collections  used  as 
a  source  of  training  instances,  noting  the  number  of  term 
tokens  and  types  that  appeared  in  the  training  data  as  the 
source  collection  size  was  increased.  This  resulted  in  a 
new  model  M3  with  an  optimal  size  of  10,000  documents, 
which  yielded  58,306  positive  instances  and  755,156  nega¬ 
tive  instances,  distributed  over  689  positive  types  and  1,437 
negative  types  (which  is  still  significantly  fewer  than  in  our 
original  model).  Table  5  shows  that  the  larger  model  does 
not  help  increase  the  precision  over  the  smaller  models  M\ 
and  M2,  but  that  recall  increases  significantly.  Creating 
models  over  20,000  and  50,000  patents  showed  no  increase 
in  precision  or  recall. 


P  R  FI 

Mi 

m3 

0.63  0.57  0.60 
0.57  0.77  0.65 

Table  5:  Increasing  the  size  of  the  model 

We  hypothesized  that  the  large  numbers  of  instances  asso¬ 
ciated  with  a  few  frequent  terms  may  adversely  effect  the 
results,  especially  for  those  cases  where  it  is  not  very  clear 
whether  a  term  is  a  technology  or  not.  To  investigate  this, 
we  performed  two  experiments:  (1)  revising  the  training 
gold  data  of  labeled  terms  and  throwing  out  some  of  the 
more  unclear  frequent  terms,  and  (2)  taking  a  much  larger 
training  set  of  over  350,000  patents  and  down  sample  the 
number  of  instances  per  term  to  a  maximum  of  1000.  The 
first  experiment  showed  some  promise  with  small  training 
sets,  but  the  effects  tailed  off  for  larger  training  sets  and 
there  was  no  configuration  that  displayed  the  same  perfor¬ 
mance  as  Model  M3.  The  second  experiment  resulted  in  a 
slightly  higher  F-score  of  0.66. 

To  gauge  the  contribution  of  internal  and  external  features 
we  took  the  instances  as  used  for  model  M3  and  built  mod¬ 
els  with  only  internal  features  (M4)  and  only  external  fea¬ 
tures  (M5).  Table  6  shows  that  the  overall  results  are  dom¬ 
inated  by  internal  features.  Using  external  features  gives  a 
high  precision  but  an  extremely  low  recall.  This  seems  to 
suggest  that  technologies  in  general  are  not  characterized 
by  their  linguistic  context. 


P 

R 

FI 

m3 

0.57 

0.77 

0.65 

m4 

0.55 

0.77 

0.64 

M5 

0.73 

0.04 

0.08 

Table  6:  Internal  and  external  features 

We  also  looked  at  the  impact  on  the  f-score  when  remov¬ 
ing  each  of  the  features  individually.  Most  features,  when 
taken  out  in  isolation,  did  not  have  much  impact  on  the 


2012 


score.  The  most  notable  exceptions  was  the  last.word 
feature,  whose  removal  reduced  the  f-score  by  0.09.  The 
phrase  length  feature  plen  and  the  suf  f  ix4  feature  both 
reduced  the  f-score  by  0.02.  Note  that  these  are  all  internal 
features. 

The  difference  in  performance  between  single-token  terms 
and  multi-token  terms  is  shown  in  Table  7  below.  The  sys¬ 
tem  labels  were  created  with  model  M3,  but  evaluation  was 
partitioned  according  to  the  single-token  versus  multi-token 
distinction. 


P 

R 

FI 

all  terms 

0.41 

0.69 

0.52 

single-token  terms 

0.20 

0.08 

0.09 

multi-token  terms 

0.42 

0.80 

0.55 

Table  7:  Performance  on  single-token  terms  and  multi¬ 
token  terms 

Note  that  the  numbers  in  the  ’’all  terms”  row  are  not  the 
same  as  the  numbers  for  model  M3  as  reported  before.  This 
is  because  the  basic  evaluation  set  was  too  small  to  allow 
for  meaningful  metrics  for  the  single-token  terms.  We  in¬ 
creased  the  size  of  the  evaluation  set,  but  have  not  yet  per¬ 
formed  quality  control  on  this  new  set.  Initial  inspection 
showed  a  larger  percentage  of  annotation  errors  that  in  the 
basic  set,  which  is  probably  the  reason  that  precision  and 
recall  are  lower. 

What  jumps  out  is  the  very  low  recall  for  single-token 
terms.  We  have  not  yet  determined  what  exactly  is  at  the 
core  of  this. 

Comparing  the  results  for  classifiers  trained  on  different 
training  sets,  we  note  that  precision  is  highest  when  the 
coverage  of  different  terms  (types)  in  the  training  data  is 
highest  (Table  2).  Recall  appears  to  benefit  more  than  preci¬ 
sion  from  training  sets  which  include  more  instances  of  the 
same  terms.  These  additional  instances  provide  new  con¬ 
textual  features  which  increase  opportunities  for  general¬ 
ization.  However,  the  bulk  of  these  additional  contexts  may 
be  coming  from  a  relatively  small  set  of  common  patent 
terms.  If  even  a  small  number  of  these  common  terms  are 
labeled  incorrectly  in  the  gold  data  (or  else  have  multiple 
interpretations  and  should  not  have  been  assigned  a  y/n  la¬ 
bel),  these  could  have  an  increasingly  negative  effect  as  the 
number  of  training  instances  containing  them  grows.  This 
may  account  for  the  slight  dip  in  precision  for  the  larger 
training  set  sizes.  One  way  to  correct  for  this  might  be  to 
limit  the  number  of  instances  used  for  any  one  term  so  that 
the  contribution  to  feature  weights  in  the  learned  model  is 
spread  more  evenly  among  different  labeled  terms. 

The  growth  rate  of  instances  relative  to  term  types  as  the 
number  of  documents  in  the  training  set  increases  suggests 
that  getting  sufficient  coverage  of  rare  terms  in  the  training 
data  may  require  very  large  document  sets.  Nevertheless, 
the  precision/recall  performance  for  the  initial  training  set, 
which  contains  instances  of  1033  positive  terms  and  1407 
negative  terms,  is  very  encouraging  and  suggests  that  in¬ 
creasing  the  coverage  of  rare  terms  in  the  training  set  could 
lead  to  further  improvements  in  performance. 


4.  Multilingual  Processing 

The  overall  process  was  essentially  the  same  for  Chinese 
and  German,  although  each  language  presented  several 
problems  of  its  own.  The  document  structure  parser  needed 
some  language-specific  declarations  to  deal  with  useful  sec¬ 
tion  headers  in  Chinese  like  technical  field  and  background 
art.  German  patents  on  the  other  hand  had  little  overt  doc¬ 
ument  structure. 

Because  Chinese  does  not  separate  its  words  using  white 
space,  a  word  segmentation  step  was  required  prior  to  part- 
of-speech  tagging.  This  was  accomplished  using  a  Chinese 
word  segmenter  included  with  the  Stanford  University  lan¬ 
guage  processing  toolkit.  We  used  this  same  toolkit  for  sen¬ 
tence  splitting  and  part-of-speech  tagging  for  all  languages. 
Patterns  for  chunking  tagged  words  into  candidate  phrases 
had  to  be  constructed  for  each  language.  Most  contex¬ 
tual  feature  definitions  were  sharable  among  the  three  lan¬ 
guages,  with  small  variations  due  to  syntactic  differences. 
The  main  time  investment  in  moving  to  Chinese  or  German 
was  in  the  manual  annotation.  For  comparison,  we  anno¬ 
tated  2000  terms  in  all  three  languages. 

Abstracting  away  from  the  effort  to  add  a  segmenter,  the 
time  efforts  to  add  Chinese  and  German  versions  of  the 
language-specific  components  were  very  similar.  In  both 
cases  it  took  a  computational  linguist  about  a  week  to  adapt 
the  document  structure  component,  integrate  the  part-of- 
speech  tagger,  write  chunker  rules,  define  and  adapt  fea¬ 
ture  extraction  rules  and  manually  annotate  terms.  An  ad¬ 
ditional  day  was  needed  to  prepare  the  evaluation  gold  stan¬ 
dard. 

4.1.  Multilingual  Evaluation 

Manual  annotation  occurred  in  two  phases.  In  a  first  phase, 
which  was  done  for  English,  Chinese  and  German,  we  took 
the  2000  most  frequent  technology  candidate  terms  from 
a  training  set  and  associated  these  manually  with  ’y’  and 
’n’  labels.  There  was  some  revision  of  guidelines  and  re¬ 
annotation,  but  the  focus  was  on  quickly  generating  labeled 
instances.  In  a  second  phase,  which  we  did  for  English 
only,  annotation  guidelines  were  given  a  closer  look  and 
a  new  label  ’?’  was  introduced  which  allowed  annotators 
to  mark  terms  that  should  not  be  used  to  generate  positive 
or  negative  instances.  Consequently,  the  English  annota¬ 
tion  was  completely  revised.  In  addition,  extra  terms  were 
added  to  the  English  term  list.  In  this  section,  we  compare 
an  older  version  of  the  English  system  to  the  Chinese  and 
German  systems,  hence,  the  English  results  do  not  match 
those  reported  earlier  in  the  paper.  The  multilingual  results 
are  presented  in  Table  8. 


P 

R 

FI 

English 

0.67 

0.44 

0.53 

Chinese 

0.52 

0.21 

0.30 

German 

0.85 

0.36 

0.56 

Table  8:  Precision,  recall  and  f-score  for  ENglish,  Chinese 
and  German 

The  Chinese  system  has  better  precision  than  the  English 
system  at  the  higher  MaxEnt  thresholds  (not  pictured  in  the 
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table),  but  recall  and  f-score  lag  English  scores  consistently 
by  a  large  margin.  The  lower  recall  may  partially  be  at¬ 
tributable  to  a  lower  number  of  positive  training  instances 
(1286  versus  2496).  The  German  system  however  has  ac¬ 
cess  to  a  similar  number  of  positive  labels  as  the  Chinese 
system,  yet  has  recall  at  the  level  of  the  English  system.  We 
have  not  yet  explained  this  anomaly.  Even  more  remarkable 
is  the  extremely  high  precision  of  the  German  system.  This 
is  most  likely  at  least  in  part  the  result  of  a  statistical  fluke. 
The  German  evaluation  set  turned  out  to  have  many  less 
terms  than  the  English  one  (552  versus  1436)  and  he  num¬ 
bers  in  Table  8  are  based  on  small  numbers  of  true  and  false 
positives. 

The  generally  lower  number  of  positive  and  negative  train¬ 
ing  samples  for  Chinese  and  German  can  be  explained  by 
the  size  of  the  datasets.  The  500  English  patents  comprise 
3.7  million  tokens  whereas  the  500  Chinese  and  500  Ger¬ 
man  patents  contain  1.7  million  and  1.3  million  tokens  re¬ 
spectively. 

5.  Conclusions 

The  identification  of  technology  terms  within  a  collection 
of  patents  is  a  challenging  information  extraction  task  due 
to  the  nature  of  technology  terms  themselves,  which  may 
be  ambiguous  or  generic  and  have  multiple  nuances  of  in¬ 
terpretation.  Initial  results  using  a  supervised  learning  ap¬ 
proach  are  nonetheless  very  promising  and  appear  to  be 
readily  extensible  to  multiple  languages.  Our  study  points 
to  a  number  of  areas  for  future  work,  including  further  re¬ 
finements  to  our  annotation  guidelines  and  annotation  strat¬ 
egy,  a  better  understanding  of  the  relative  contributions  of 
additional  training  terms  vs.  additional  term  instances,  and 
the  development  of  strategies  for  combining  term  scores 
from  multiple  documents.  We  also  plan  to  compare  alterna¬ 
tive  approaches  for  the  construction  of  training  instances. 
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